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Abstract 


The recently proposed neural network joint model (NNJM) (Devlin et al., 2014) augments the 


n-gram target language model with a heuristically chosen source context window, achieving 
state-of-the-art performance in SMT. In this paper, we give a more systematic treatment by 
summarizing the relevant source information through a convolutional architecture guided by 
the target information. With different guiding signals during decoding, our specifically de¬ 
signed convolution-i-gating architectures can pinpoint the parts of a source sentence that are 
relevant to predicting a target word, and fuse them with the context of entire source sentence 
to form a unified representation. This representation, together with target language words, are 
fed to a deep neural network (DNN) to form a stronger NNJM. Experiments on two NIST 
Chinese-English translation tasks show that the proposed model can achieve significant im¬ 
provements over the previous NNJM by up to -1-1.08 BEEU points on average. 


1 Introduction 

beaming of continuous space representation for source language has attracted much attention in both 
traditional statistical machine translation (SMT) and neural machine translation (NMT). Various mod¬ 
els, mostly neural network-based, have been proposed for representing the source sentence, mainly 
as the encoder part in an encoder-decoder framework (Bengio et al., 2003 Auli et al., 2013t Kalch- 
brenner and Blunsom, 2013j Cho et al., 2014 Sutskever et al., 2014| ). There has been some quite 
recent work on encoding only “relevant” part of source sentence during the decoding process, most 
notably neural network joint model (NNJM) in ( Devlin et al., 2014) ), which extends the n-grams target 
language model by additionally taking a fixed-length window of source sentence, achieving state-of- 
the-art performance in statistical machine translation. 

In this paper, we propose novel convolutional architectures to dynamically encode the relevant 
information in the source language. Our model covers the entire source sentence, but can effectively 
find and properly summarize the relevant parts, guided by the information from the target language. 
With the guiding signals during decoding, our specifically designed convolution architectures can 
pinpoint the parts of a source sentence that are relevant to predicting a target word, and fuse them with 
the context of entire source sentence to form a unified representation. This representation, together 
with target words, are fed to a deep neural network (DNN) to form a stronger NNJM. Since our 
proposed joint model is purely lexicalized, it can be integrated into any SMT decoder as a feature. 

Two variants of the joint model are also proposed, with coined name tagCNN and inCNN, with 
different guiding signals used from the decoding process. We integrate the proposed joint models 


into a state-of-the-art dependency-to-string translation system (Xie et al., 20111 to evaluate their ef¬ 
fectiveness. Experiments on NIST Chinese-English translation tasks show that our model is able to 
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Figure 1: Illustration for joint LM based on CNN encoder. 


achieve significant improvements of +2.0 BLEU points on average over the baseline. Our model also 
outperforms Devlin et al. (20141’s NNJM by up to +1.08 BLEU points. 


RoadMap: In the remainder of this paper, we start with a brief overview of joint language model 
in Section]^ while the convolutional encoders, as the key component of which, will be described in 
detail in SectionThen in Section]^ we discuss the decoding algorithm with the proposed models. 
The experiment results are reported in Section followed by Section and for related work and 
conclusion. 


2 Joint Language Model 

Our joint model with CNN encoders can be illustrated in Eigure[^(a) & (b), which consists 1) a CNN 
encoder, namely tagCNN or inCNN, to represent the information in the source sentences, and 2) 
an NN-based model for predicting the next words, with representations from CNN encoders and the 
history words in target sentence as inputs. 

In the joint language model, the probability of the target word e„, given previous k target words 
{e„_fc, • • •, e„_i} and the representations from CNN-encoders for source sentence S are 

tagCNN: p(e„|(/>i(5, {a(e„)}), 

mCNN: p(e„| </>2(5, h{{e}^zl)), 

where {a(e„)}) stands for the representation given by tagCNN with the set of indexes {a(e„)} 
of source words aligned to the target word e^, and (/) 2 (S,/i({e}”“^)) stands for the representation 
from fnCNN with the attention signal h({e}”“^). 

Let us use the example in Eigure[T| where the task is to translate the Chinese sentence 

Chinese: Wif 

Pinyin: Zhlli Juxlng Guohui Yu Zongtong Xuanju 

into English. In evaluating a target language sequence “holds parliament and 
presidential”, with “holds parliament and” as the proceeding words (assume 4-gram 






































LM), and the affiliated source wor(0 of “presidential” being “Zongtong” (determined by 
word alignment), tagCNN generates {4}) (the index of “Zongtong” is 4), and inCNN 

generates (/)2(5,/i(holds parliament and)). The DNN component then takes "holds 
parliament and" and (cpi or (j) 2 ) as input to give the conditional probability for next word, e.g., 
^("presidential"|(?!) i| 2 , {holds, parliament, and}). 

3 Convolutional Models 

We start with the generic architecture for convolutional encoder, and then proceed to tagCNN and 
inCNN as two extensions. 


3.1 Generic CNN Encoder 

The basic architecture is of a generic CNN encoder is illustrated in Figure (a), which has a fixed 
architecfure consisting of six layers: 

Layer-0: the input layer, which takes words in the form of embedding vectors. In our work, we set the 
maximum length of sentences to 40 words. For sentences shorter than that, we put zero padding 
at the beginning of sentences. 


Layer-1: a convolution layer after Layer-0, with window size = 3. As will be discussed in Section 3.2 


and 3.3 the guiding signal are injected into this layer for “guided version”. 


Layer-2: a local gating layer after Layer- 1, which simply takes a weighted sum over feature-maps in 
non-adjacent window with size = 2. 


Layer-3: a convolution layer after Layer-2, we perform another convolution with window size = 3. 
Layer-4: we perform a global gating over feature-maps on Layer-3. 

Layer-5: fully connected weights that maps the output of Layer-4 to this layer as the final represen¬ 
tation. 


3.1.1 Convolution 

As shown in Figure|^(a), the convolution in Layer-1 operates on sliding windows of words (width ki), 
and the similar definition of windows carries over to higher layers. Formally, for source sentence input 
x={xi, • • • , xtv}, the convolution unit for feature map of type-/ (among F} of them) on Layer-^ is 


Jkf) 


X 




1,3, / 


( 1 ) 


where 

• (x) gives the output of feature map of type-/ for location i in Layer-£; 

• is the parameters for / on Layer-f; 


(t(-) is the Sigmoid activation function; 

/V (£— 1 ) 

zj denotes the segment of Layer-^—1 for the convolution at location i , while 

2!”’ = [xy x.yi. x,y,|T 

concatenates the vectors for 3 words from sentence input x. 


*For an aligned target word, we take its aligned source words as its affiliated source words. And for an unaligned word, 
we inherit its affiliation from the closest aligned word, with preference given to the right ( [Devlin et ah, 2014j . Since the 
word alignment is of many-to-many, one target word may has multi affiliated source words. 








(a) The generic architecture for CNN encoder 


(c) The convolution for ;hCNN 


Figure 2: Illustration for the CNN encoders. 


3.1.2 Gating 

Previous CNNs, including those for NLP tasks (Hu et ah, 2014 Kalchbrenner et ah, 20141, take 
a straightforward convolution-pooling strategy, in which the “fusion” decisions (e.g., selecting the 
largest one in max-pooling) are based on the values of feature-maps. This is essentially a soft tem¬ 
plate matching, which works for tasks like classification, but harmful for keeping the composition 
functionality of convolution, which is critical for modeling sentences. In this paper, we propose to 
use separate gating unit to release the score function duty from the convolution, and let it focus on 
composition. 

We take two types of gating: 1) for Layer-2, we take a local gating with non-overlapping windows 
(size = 2) on the feature-maps of convolutional Layer-1 for representation of segments, and 2) for 
Layer-4, we take a global gating to fuse all the segments for a global representation. We found that 
this gating strategy can considerably improve the performance of both tagCNN and inCNN over 
pooling. 


• Local Gating: On Layer-1, for every gating window, we first find ifs original input (before 
convolution) on Layer-0, and merge them for the input of the gating network. For example, 
for the two windows: word (3,4,5) and word (4,5,6) on Layer-0, we use concatenated vector 
consisting of embedding for word (3,4,5,6) as the input of the local gating network (a logistic 
regression model) to determine the weight for the convolution result of the two windows (on 
Layer-1), and the weighted sum are the output of Layer-2. 

(3) 

• Global Gating: On Layer-3, for feature-maps at each location i, denoted z- , the global gating 
network (essentially soft-max, parameterized w^), assigns a normalized weight 


/ (3)\ w ' zT-* / \ ^ 
uj{zl ’) = e 9 ^ 


(3) 
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and the gated representation on Layer-4 is given by the weighted sum )z\ . 


3.1.3 Training of CNN encoders 

The CNN encoders, including tagCNN and mCNN that will be discussed right below, are trained in 
a joint language model described in Section]^ along with the following parameters 

• the embedding of the words on source and the proceeding words on target; 












the parameters for the DNN of joint language model, include the parameters of soft-max for word 
probability. 


The training procedure is identical to that of neural network language model, except that the parallel 
corpus is used instead of a monolingual corpus. We seek to maximize the log-likelihood of training 
samples, with one sample for every target word in the parallel corpus. Optimization is performed with 
the conventional back-propagation, implemented as stochastic gradient descent (LeCun et ah, 19981 
with mini-batches. 


3.2 tagCm 

tagCNN inherits the convolution and gating from generic CNN (as described in Section [3T] ), with the 
only modification in the input layer. As shown in Figure|^(b), in tagCNN, we append an extra tagging 
bit (0 or 1) to the embedding of words in the input layer to indicate whether it is one of affiliated words 
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Those extended word embedding will then be treated as regular word-embedding in the convolutional 
neural network. This particular encoding strategy can be extended to embed more complicated depen¬ 
dency relation in source language, as will be described in Section [5^ 

This particular “tag” will be activated in a parameterized way during the training for predicting the 
target words. In other words, the supervised signal from the words to predict will find, fhrough layers 
of back-propagafion, fhe imporfance of fhe fag bif in fhe “affiliafed words” in fhe source language, 
and learn fo pul proper weighl on if fo make fagged words sland oul and adjusl olher paramelers in 
tagCNN accordingly for fhe oplimal prediclive performance. In doing so, fhe joinl model can pinpoinf 
fhe parfs of a source senfence lhal are relevanf fo predicfing a fargel word fhrough fhe already learned 
word alignmenf. 


3.3 inCNN 


Unlike tagCNN, which direcfly fells fhe location of affiliafed words fo fhe CNN encoder, inCNN 
sends fhe informalion aboul fhe proceeding words in fargel side fo fhe convolufional encoder fo help 
refrieve fhe informalion relevanf for predicfing fhe nexl word. This is essenfially a parficular case of 


allenlion model, analogous fo fhe aufomalic alignmenf mechanism in ( [Bahdanau ef ah, 2014[ ), where 
fhe aflenfion signal is from fhe slale of a generafive recurrenl neural nelwork (RNN) as decoder. 

Basically, fhe information from proceeding words, denofed as is injecfed info every 

convolution window in fhe source language senfence, as illuslraled in Figure |^(c). More specifically, 
for fhe window indexed by t, fhe inpuf fo convolution is given by fhe concalenaled vecfor 


zt = [h{{e}:-_l), 
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H ) N+1^ 
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In Ibis work, we use a DNN fo Iransform fhe vector concalenaled from word-embedding for words 
{e^-k ■ ■ ■ , Gn-k} into /i({e}”“^), with sigmoid activation function. Through layers of convolution 
and gating, mCNN can 1) retrieve the relevant segments of source sentences, and 2) compose and 
transform the retrieved segments into representation recognizable by the DNN in predicting the words 
in target language. Different from that of tagCNN, inCNN uses information from proceeding words, 
hence provides complementary information in the augmented joint language model of tagCNN. This 
has been empirically verified when using feature based on tagCNN and that based on mCNN in 
decoding with greater improvement. 
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Figure 3: Illustration for a dependency tree (a) with three head-dependents relations in shadow, an 
example of head-dependents relation rule (b) for the top level of (a), and an example of head rule (c). 
“Xi:NN” indicates a substitution site that can be replaced by a subtree whose root has part-of-speech 
“NN”. The underline denotes a leaf node. 


4 Decoding with the Joint Model 


Our joint model is purely lexicalized, and therefore can be integrated into any SMT decoders as a 
feature. For a hierarchical SMT decoder, we adopt the integrating method proposed by pevlin et 


al. (20141. As inherited from the n-gram language model for performing hierarchical decoding, the 


leftmost and rightmost n — 1 words from each constituent should be stored in the state space. We 
extend the state space to also include the indexes of the affiliated source words for each of these edge 
words. For an aligned target word, we take its aligned source words as its affiliated source words. 
And for an unaligned word, we use the affiliation heuristic adopted by Devlin et al. (2014] |. In this 
paper, we integrate the joint model into the state-of-the-art dependency-to-string machine translation 
decoder as a case study to test the efficacy of our proposed approaches. We will briefly describe the 
dependency-to-string translation model and then the description of MT system. 


4.1 Dependency-to-String Translation 


In this paper, we use a state-of-the-art dependency-to-string (Xie et ah, 20111 decoder (Dep2Str), 
which is also a hierarchical decoder. This dependency-to-string model employs rules that represent 
the source side as head-dependents relations and the target side as strings. A head-dependents relation 
(HDR) is composed of a head and all its dependents in dependency trees. Figurej^shows a dependency 
tree (a) with three HDRs (in shadow), an example of HDR rule (b) for the top level of (a), and an 
example of head rule (c). HDR rules are constructed from head-dependents relations. HDR rules can 
act as both translation rules and reordering rules. And head rules are used for translating source words. 

We adopt the decoder proposed by Meng et al. (2013] ) as a variant of Dep2Str translation that 
is easier to implement with comparable performance. Basically they extract the HDR rules with 
GHKM ( Galley et ah, 2004j ) algorithm. For the decoding procedure, given a source dependency tree 
T, the decoder transverses T in post-order. The bottom-up chart-based decoding algorithm with cube 
pruning (Chiang, 2007 j [Huang and Chiang, 20071 is used to find the /c-best items for each node. 





















4.2 MT Decoder 


Following Och and Ney (20()2| ), we use a general loglinear framework. Let d be a derivation that 
convert a source dependency tree into a target string e. The probability of d is defined as: 


P{d) 

i 


( 2 ) 


where (^j are features defined on derivations and A* are the corresponding weights. Our decoder 
contains the following features: 

Baseline Features: 


• translation probabilities F’(fls) and P(s|f) of HDR rules; 

• lexical translation probabilities FLEx(f|'S) and PLEx(s|f) of HDR rules; 

• rule penalty exp(—l); 

• pseudo translation rule penalty exp(—1); 

• target word penalty exp (| e |); 

• re-gram language model PLM(e); 

Proposed Features: 

• re-gram fapCNN joint language model PtlmIc); 


• re-gram freCNN joint language model PiLM(e). 

Our baseline decoder contains the first eight features. The pseudo translation rule (constructed ac¬ 
cording to the word order of a HDR) is to ensure the complete translation when no matched rules is 
found during decoding. The weights of all these features are tuned via minimum error rate training 
(MERT) (Och, 20031. For the dependency-to-string decoder, we set rule-threshold and stack-threshold 
to 10“^, rule-limit to 100, stack-limit to 200. 


5 Experiments 

The experiments in this Section are designed to answer the following questions: 

1. Are our ta^CNN and ireCNN joint language models able to improve translation quality, and are 
they complementary to each other? 

2. Do zreCNN and tagCHN benefit from their guiding signal, compared to a generic CNN? 

3. For fapCNN, is it helpful to embed more dependency structure, e.g., dependency head of each 
affiliated word, as additional information? 

4. Can our gating strategy improve the performance over max-pooling? 

5.1 Setup 

Data: Our training data are extracted from FDC dat£0 We only keep the sentence pairs that the 
length of source part no longer than 40 words, which covers over 90% of the sentence. The bilingual 
training data consist of 22IK sentence pairs, containing 5.0 million Chinese words and 6.8 million 
English words. The development set is NIST MT03 (795 sentences) and test sets are MT04 (1499 
sentences) and MT05 (917 sentences) after filtering with length limit. 

^The corpora include LDC2002E18, LDC2003E07, LDC2003E14, LDC2004T07, LDC2005T06. 







Preprocessing: The word alignments are obtained with GIZA++ (Och and Ney, 20031 on the cor¬ 
pora in both directions, using the “grow-diag-final-and” balance strategy ( |Koehn et ah, 2003 ). We 
adopt SRI Language Modeling Toolkit (Stolcke and others, 20021 to train a 4-gram language model 
with modified Kneser-Ney smoothing on the Xinhua portion of the English Gigaword corpus (306 
million words). We parse the Chinese sentences with Stanford Parser into projective dependency 
trees. 


Optimization of NN: In training the neural network, we limit the source and target vocabulary to 
the most frequent 20K words for both Chinese and English, covering approximately 97% and 99% 
of two corpus respectively. All the out-of-vocabulary words are mapped to a special token unk. We 
used stochastic gradient descent to train the joint model, setting the size of minibatch to 500. All joint 
models used a 3-word target history (i.e., 4-gram EM). The dimension of word embedding and the 
attention signal /i({e}”“^) for mCNN are 100. Eor the convolution layers (Eayer 1 and Eayer 3), we 
apply 100 filters. And the final representation of CNN encoders is a vector with dimension 100. The 
final DNN layer of our joint model is the standard multi-layer perceptron with softmax at the top layer. 


Metric: We use the case-insensitive 4-gram NIST BEElij^as our evaluation metric, with statistical 
significance fesf with sign-test (Collins et ah, 2005]l between the proposed models and two baselines. 


5.2 Setting for Model Comparisons 

We use the tagCNN and inCNN joint language models as additional decoding features to a 
dependency-to-string baseline system (Dep2Str), and compare them to the neural network joint model 



NNJM as BBN-JM hereafter. Although the BBN-JM in (Devlin et ah, 20141 is originally tested in the 
hierarchical phrase-based ( Chiang, 2007| ) SMT and string-to-dependency (Shen et ah, 20081 SMT, it 
is fairly versatile and can be readily integrated into Dep2Str. 


5.3 The Main Results 

The main results of different models are given in Table [T] Before proceeding to more detailed com¬ 
parison, we first observe that 


the baseline Dep2Str system gives BEEU 0.5-1- higher than the open-source phrase-based system 
Moses (IKoehn et ah, 2007||; 


BBN-JM can give about -1-0.92 BEEU score over Dep2Str, a result similar as reported in (Devlin 
et ah, 2014j ). 


Clearly from Table[T| tagCNN and mCNN improve upon the Dep2Str baseline by -1-1.28 and -1-1.75 
BEEU, outperforming BBN-JM in the same setting by respectively -1-0.36 and -1-0.83 BEEU, averaged 
on NIST MT04 and MT05. These indicate that tagCNN and mCNN can individually provide dis¬ 
criminative information in decoding. It is worth noting that znCNN appears to be more informative 
than the affiliated words suggested by the word alignment (GIZA-i-i-). We conjecture that this is due 
to the following two facts 


f tp: / / jaguar .ncsl.nist. gov/mt / resources/mteval-vllb. pi 
"^http://nlg.isi.edu/software/nplm/ 
























Systems 

MT04 

MT05 

Average 

Moses 

34.33 

31.75 

33.04 

Dep2Str 

34.89 

32.24 

33.57 

+ BBN-JM (Devlin et al., 2014); 

36.11 

32.86 

34.49 

+ CNN (generic) 

36.12* 

33.07* 

34.60 

-|- tagCNN 

36.33* 

33.37* 

34.85 

-K mCNN 

36.92* 

33.72* 

35.32 

-K tagCNN + mCNN 

36.94* 

34.20* 

35.57 


Table 1: BLEU-4 scores (%) on NIST MT04-test and MT05-test, of Moses (default settings), 
dependency-to-string baseline system (Dep2Str), and different features on top of Dep2Str: neural 
network joint model (BBN-JM), generic CNN, tagCNN, ireCNN and the combination of tagCNN 
and ireCNN. The boldface numbers and superscript * indicate that the results are significantly better 
(p<0.01) than those of the BBN-JM and the Dep2Str baseline respectively. “+” stands for adding the 
corresponding feature to Dep2Str. 


• znCNN avoids the propagation of mistakes and artifacts in the already learned word alignment; 

• the guiding signal in znCNN provides complementary information to evaluate the translation. 

Moreover, when tagCNN and mCNN are both used in decoding, it can further increase its winning 
margin over BBN-JM to -i-1.08 BLEU points (in the last row of Table[T]), indicating that the two models 
with different guiding signals are complementary to each other. 

The Role of Guiding Signal It is slight surprising that the generic CNN can also achieve the gain 
on BLEU similar to that of BBN-JM, since intuitively generic CNN encodes the entire sentence and 
the representations should in general far from optimal representation for joint language model. The 
reason, as we conjecture, is CNN yields fairly informative summarization of the sentence (thanks to 
its sophisticated convolution and gating architecture), which makes up some of its loss on resolution 
and relevant parts of the source senescence. That said, the guiding signal in both tagCNN and inCNN 
are crucial to the power of CNN-based encoder, as can be easily seen from the difference between 
the BLEU scores achieved by generic CNN, tagCNN, and mCNN. Indeed, with the signal from the 
already learned word alignment, tagCNN can gain -1-0.25 BLEU over its generic counterpart, while 
for inCNN with the guiding signal from the proceeding words in target, the gain is more saliently 
-tO.72 BLEU. 


5.4 Dependency Head in tagCNN 

In this section, we study whether tagCNN can further benefit from encoding richer dependency struc¬ 
ture in source language in the input. More specifically, the dependency head words can be used to 
further improve tagCNN model. As described in Section |3.2[ in tagCNN, we append a tagging bit 
(0 or 1) to the embedding of words in the input layer as tags on whether they are affiliated source 
words. To incorporate dependency head information, we extend the tagging rule in Section |3.2| to 


add another tagging bit (0 or 1) to the word-embedding for original tagCNN to indicate whether it is 
part of dependency heads of the affiliated words. Eor example, if Xj is the embedding of an affiliated 
source word and Xj the dependency head of word Xj, the extended input of tagCNN would contain 


^(Aff, non-head) ^ 1 0]T 

xfON-AFF,HEAD) ^ 0 










Systems 

MT04 

MT05 

Average 

Dep2str 

34.89 

32.24 

33.57 

+tagCNN 

36.33 

33.37 

34.85 

H-la(7CNN dep 

36.54 

33.61 

35.08 


Table 2: BLEU-4 scores (%) of tagCNN model with dependency head words as additional tags 
(fa^iCNN.dep). 


Systems 

MT04 

MT05 

Average 

Dep2SU' 

34.89 

32.24 

33.57 

H-inCNN 

36.92 

33.72 

35.32 

H-inCNN-2-pooling 

36.33 

32.88 

34.61 

H-inCNN-4-pooling 

36.46 

33.01 

34.74 

H-inCNN-8-pooling 

36.57 

33.39 

34.98 


Table 3: BLEU-4 scores (%) of inCNN models implemented with gating strategy and k max-pooling, 
where k is of {2, 4, 8}. 


If the affiliated source word is the root of a sentence, we only append 0 as the second tagging bit since 
the root has no dependency head. Erom Table with the help of dependency head information, we 
can improve tagCNN by -1-0.23 BLEU points averagely on two test sets. 

5.5 Gating Vs. Max-pooling 

In this section, we investigate to what extent that our gating strategy can improve the translation 
performance over max pooling, with the comparisons on mCNN model as a case study. Eor imple¬ 
mentation of inCNN with max-pooling, we replace the local-gating (Layer-2) with max-pooling with 
size 2 (2-pooling for short), and global gating (Layer-4) with k max-pooling (“fe-pooling”), where k 
is of {2,4, 8}. Then, we use the mean of the outputs of fc-pooling as the final inpul of Layer-5. In 
doing so, we can guarantee fhe inpul dimension of Layer-5 is Ihe same as Ihe archileclure wilh galing. 
Erom Table we can clearly see fhal our gating sfrafegy can improve Iranslafion performance over 
max-pooling by 0.34~0.71 BLEU poinls. Moreover, we find 8-pooling yields performance heller 
lhan 2-pooling. We conjeclure lhal Ihis is because Ihe useful relevanl parls for Iranslalion are mainly 
concenlraled on a few words of Ihe source sentence, which can be belter exlracled wilh a larger pool 
size. 


6 Related Work 


The seminal work of neural nelwork language model (NNLM) can be Iraced lo Bengio el al. (2003] ) 
on monolingual lexl. Il is recenlly extended by Devlin et al. (2014] | lo include additional source 
conlexl (11 source words) in modeling Ihe largel sentence, which is clearly mosl related lo our work, 
wilh however Iwo imporlanl differences: 1) instead of Ihe ad hoc way of selecting a conlexl window 
in ( Devlin el al., 20f4| ), our model covers Ihe entire source sentence and automatically distill Ihe 
conlexl relevanl for largel modeling; 2) our convolutional archileclure can effectively leverage guiding 
signals of vaslly differenl forms and nalure from Ihe largel. 

Prior to our model Ihere is also work on representing source sentences wilh neural nelworks, 
including RNN ( Cho el al., 2014| Sulskever et al., 20141 and CNN ( [Kalchbrenner and Blunsom7 
20131. These work typically aim to map the entire sentence to a vector, which will be used later by 
RNN/LSTM-based decoder to generate the target sentence. As demonstrated in Section the rep- 






























resentation learnt this way cannot pinpoint the relevant parts of the source sentences (e.g., words or 
phrases level) and therefore is inferior to be directly integrated into traditional SMT decoders. 


Our model, especially inCNN, is inspired by is the automatic alignment model proposed in (Bah- 


danau et ah, 2014| ). As the first effort to apply attention model to machine translation, it sends the state 
of a decoding RNN as attentional signal to the source end to obtain a weighted sum of embedding of 
source words as the summary of relevant context. In contrast, mCNN uses 1) a different attention sig¬ 
nal extracted from proceeding words in partial translations, and 2) more importantly, a convolutional 
architecture and therefore a highly nonlinear way to retrieve and summarize the relevant information 
in source. 


7 Conclusion and Future Work 

We proposed convolutional architectures for obtaining a guided representation of the entire source 
sentence, which can be used to augment the n-gram target language model. With different guiding 
signals from target side, we devise tagCNN and fnCNN, both of which are tested in enhancing a 
dependency-to-string SMT with -1-2.0 BLEU points over baseline and -1-1.08 BLEU points over the 


state-of-the-art in (Devlin et ah, 20141. Eor future work, we will consider encoding more complex 
linguistic structures to further enhance the joint model. 


References 

[Auli et al.2013] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint language and 
translation modeling with recuiTent neural networks. In Proceedings of the 2013 Conference on Empirical 
Methods in Natural Language Processing, pages 1044-1054, Seattle, Washington, USA, October. 

[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine ttans- 
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. 

[Bengio et al.2003] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural 
probabilistic language model. Journal OF Machine Learning Research, 3:1137-1155. 

[Chiang2007] David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 
33(2):201-228. 

[Cho et al.2014] Kyunghyun Cho, Bart van MeiTienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, 
Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder 
for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural 
Language Processing (EMNLP), pages 1724-1734, Doha, Qatar, October. 

[Collins et al.2005] Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for sta¬ 
tistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational 
Linguistics, pages 531-540. 

[Devlin et al.2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John 
Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In Proceed¬ 
ings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 
pages 1370-1380, Baltimore, Maryland, June. 

[Galley et al.2004] Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. 2004. What’s in a ttansla- 
tion rule. In Proceedings of HLT/NAACL, volume 4, pages 273-280. Boston. 

[Hu et al.2014] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network 
architectures for matching natural language sentences. In NIPS. 







[Huang and Chiang2007] Liang Huang and David Chiang. 2007. Forest rescoring: Faster decoding with inte¬ 
grated language models. In Annual Meeting-Association For Computational Linguistics, volume 45, pages 
144-151. 

[Kalchbrenner and Blunsom2013] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation 
models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 
pages 1700-1709, Seattle, Washington, USA, October. 

[Kalchbrenner et al.2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional 
neural network for modelling sentences. ACL. 

[Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based transla¬ 
tion. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Compu¬ 
tational Linguistics on Human Language Technology-Volume 1, pages 48-54. 

[Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, 
Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, 
Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine transla¬ 
tion. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Compan¬ 
ion Volume Proceedings of the Demo and Poster Sessions, pages 177-180, Prague, Czech Republic, June. 

[LeCun et al.l998] Y. LeCun, L. Bottou, G. Orr, and K. Muller. 1998. Efficient backprop. In Neural Networks: 
Tricks of the trade. Springer. 

[Meng et al.2013] Eandong Meng, Jun Xie, Linfeng Song, Yajuan Lii, and Qun Liu. 2013. Translation with 
source constituency and dependency trees. In Proceedings of the 2013 Conference on Empirical Methods in 
Natural Language Processing, pages 1066-1076, Seattle, Washington, USA, October. 

[Och and Ney2002] Eranz Josef Och and Hermann Ney. 2002. Discriminative training and maximum entropy 
models for statistical machine translation. In Proceedings of the 40th Annual Meeting on Association for 
Computational Linguistics, pages 295-302. 

[Och and Ney2003] Eranz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical 
alignment models. Computational linguistics, 29(1):19-5L 

[Och2003] Eranz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceed¬ 
ings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160-167. 

[Shen et al.2008] Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A new string-to-dependency machine 
translation algorithm with a target dependency language model. In Proceedings of ACL-08: HLT, pages 
577-585. 

[Stolcke and others2002] Andreas Stolcke et al. 2002. Srilm-an extensible language modeling toolkit. In Pro¬ 
ceedings of the international conference on spoken language processing, volume 2, pages 901-904. 

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with 
neural networks. CoRR, abs/1409.3215. 

[Xie et al.2011] Jun Xie, Haitao Mi, and Qun Liu. 2011. A novel dependency-to-string model for statistical ma¬ 
chine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 
pages 216-226. 



